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ABSTRACT 

Four monte carlo simulation studies of Owen*s 
Bayesian sequential procedure for adaptive mental testing vere 
conducted* Whereas previous simulation studies of this procedure have 
concentrated on evaluating it in ter^is of the correlation of its test 
scores with simulated ability in a normal population, these four 
studies explored a number of additional properties, both in a 
normally distributed population and in a distribution-free context* 
Study 1 replicated previous studies with finite item pools, but 
examined such properties as the bias or estimate, mean absolute 
error, and correlation of test lengtb with ability* Studies 2 and 3 
examined the same variables in a number of hypothetical infinite item 
pools, investigating the effects of item discriminating power, 
guessing, and variable vs* fixed test length* Study H investigated 
some properties of the Bayesian test scores as latent trait 
estimators, under three different configurations (regressions of item 
discrimination on item difficulty) of item pools* The properties of 
interest included the regression of latent trait estimates on actual 
trait levels, the conditional bias of such estimates, the information 
curve of the trait estimates, and the relationship of test length to 
ability level* The results of these studies indicated that the 
ability estimates derived from the Bayesian test strategy wejre highly 
correlated with ability level. However, the ability estimates were 
also highly correlated with number of items administered, were 
nonlinearly biased, and provided measurements which were not of equal 
precision at all levels of ability* (Author) 
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te. SUfTLcOullYAHY NOTES 

Portions of this paper were presented at the Spring 1975 meeting of the 
Psychometric Society, Iowa City, Iowa, April 24* 1975, and the Conference on 
Computerized Adaptive Testing, Washington, D.C., June 12, 1975. 


testing sequential testing programmed testing 

ability testing branched testing response-contingent testing 

computerised testing Individualized testing automated testing 

adaptive testing tailored testing 1 
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Four monte carlo simulation studies of Owen's Bayeslan sequential 
procedure for adaptive mental 'testing were conducted. Whereas previous 
simulation studies of this procedure have concentrated on evaluating It In 
terms of the correlation of Its test scores with simulated ability In k 
normal population, these four studies explored a number of additional 
properties, both In a normally distributed population and In a distribution- 
free context. Study 1 replicated previous studies with finite Item pools^ 
but examined Such properties as the bias of estimate, mean absolute error, . 
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and correlation of test length with ability* Studies 2 and 3 examined the 
same variables in a number of hypothetical infinite item pools, investigating 
the effects of item discriminating power, guessing, and variable vs* fixed 
test length* Study 4 investigated some pru^ierties of the Bayesian test scores 
as latent trait estimators, under three different configurations (regressions 
of item discrimination on item difficulty) of item pools* The properties of 
interest included the regression of latent trait estimates on actual trait 
levels, the conditional bias of such estimates, the information curve of the 
trait estimates, and the relationship of test length to ability level* The 
results of these studies indicated that the ability estimates derived from 
the Bayesian test strategy were highly correlated with ability level* 
However, the ability estimates were also highly correlated with number of 
items administered, were non-linearly biased and provj^ded measurements 'which 
were not of equal precision at all levels of ability- 
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Some Properties of a Bayesian 
Adaptive Ability Testing Strategy 



Adaptive or tailored ability testing subsumes a number of different 
strategies for adapting the difficulty of test items to the examinee's 
ability level. All the adaptive testing strategies have as one objective 
the improvement of the psychometric properties of mental test scores 
throughout the range of the trait of interest (e.g., ability). This is 
accomplished by adapting test item difficulty to each examinee's ability, 
during the test itself. Ideally the adaptive selection and administration 
of test items would result in each examinee answering only those items 
which are most informative for his own ability level. Additionally, where 
items can be answered correctly by random guessing (e»g,, multiple-choice 
items), an optimally efficient adaptive item selection technique would 
have the effect of equalizing the effect of guessing on test scores 
throughout the ability range* 

The different item selection techniques of the various adaptive 
testing stratei;ies have been described by Weiss (1974), One of the most 
elegant of the adaptive strategies is a Bayesian sequential technique 
proposed by Oven (1969, 1975) and studied empirically by several investi- 
gators including Wood (1971), Urry (1971) and Jensema (1972), 

Oven's Bayesian Sequential Adaptive Testing Strategy 

Owen's technique is a general one for the sequential design and 
analysis of independent experiments with a dichotomous response* Its 
application in mental testing is to the problem of estimating ability by 
means of sequential selection, administration, and scoring of dichotomous 
test items* The mathematical details of the method arise from latent trait 
theory, with the item characteristic curves all assumed to take the form 
of the normal ogive* The properties of the normal ogive item characteristic 
function and its logistic approximation have been described by Lord £( 
Novick (1968) and Birnbaum (1968), respectively* 

Owen's procedure itivolves the individually tailored sequential design 
of a test by appropriate choice of available item parameters^ and estima- 
tion of ability (0) via a Bayesian-motivated approximation. At each step 
m in the ability estimation sequence a normal prior distribution on 0 is 
assumed, with parameters and a^, where m indicates the number of items 

already administered in the sequence. A test item to be administered at 
step m+l is selected so as to minimize a quadratic loss function on 

With no guessing (i*e*, a =0) and the discrimination parameters a constant 

g g 
over items, the appropriate item is the available one which minimizes the 
absolute value of the difference (fc^-y )* With a>0 the optimal difference 



^Each item g can be characterized by three parameters--a , , which 

& & & 

are, respectively, the item discriminating power, item difficulty, and item 
guessing parameter* The guessing parameter, c , is simply the probability 

of answering the item correctly by chance alone* 
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Is somewhat negative; that Is, optimal difficulty Is somewhat "easier** than 
examlnee^s ability. 

Following Item administration at step m+l, the parameters y and 

xti m 

of the prior distribution are updated In accord with the examinee^s perfor- 
mance on the Item* In the case of a correct answer: 



Mel. 



and 



= varCojl) = 



Following a wrong answer; 



m 



and 



. = var(o[0) =- 



^ g m/ 



In Equations 1 through 4 (taken from Owen, 1975) 

4,{D) is the normal probability density function) 

^(D) is the cumulative normal distribution function, and 



. [2] 



[31 



[43 



[5] 



» C + (1-C„) * i-D) . 

g g 



[6] 
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The parameters y and of the Bayes posterior distribution on 0 are 

w+i m+i 

used as the parameters of the next step^s prior* At each step the prior 
distribution Is assumed to be normal* Testing may be terminated when 

m 

becomes arbitrarily small or when m becomes arbitrarily large, or when some 

other criterion has been reached* At termination the latest p Is the 

m 

estimator of 0, and Is a measure of the uncertainty of the estimate* 

m 

Urry (1971) and Jensema (1972, 1974) have Interpreted as the squared 

m 

standard error of eotlmate (S*E*E*) of 0^* Owen (1975) gives a theorem 

showing that as t?? <», p Q; that Is, the posterior mean Is a consistent 

m ' 

estimator of an examlnee^s ability* 

Practically speaking, of course, the number of Items administered will 
never approach infinity; but If the pool of available Items Is sufficiently 
large and appropriately constituted, will diminish rapidly, permitting 

valid estimation of 0 using a small number of Items* Urry (1971, 1974) has 
specified the requirements for a satlsftictory Item pool for Implementing 
Owen^s testing procedure and has shown In computer simulation studies that 
Owen^s sequential test can achieve In 3 to 30 Items the validity of a much 
longer conventional test, with the number of Items needed diminishing as 
Item discriminatory power Increased* 

Urry's (1971, 1974) and Jensema's (1972, 1974) monte carlo simulation 
studies of Owen^s Bayeslan testing strategy have evaluated Its merit solely 
In terms of the "fidelity" (or "validity")^ of the resulting ability estimates 
and the mean number of Items required to achieve any specified value of the 
fidelity coefficient* Although the fidelity coefficient Is of great Interest, 
Lord (1970, p* 152) has pointed out that evaluating an adaptive test by 
means of a group statistic such as the correlation coefficient presumes some 
knowledge of the group^s distribution on the trait being measured, and 
Ignores Information relevant to the accuracy or goodness of the ability 
estimates at any given level of the trait* 

The correlation of test scores with the simulated underlying ability Is 
only one criterion by which to evaluate a proposed adaptive testing strategy* 
Since the Bayeslan sequential test scores are actually estimates of underlying 
trait level. In the same metric, the accuracy of the estimates Is also of 
Interest* "Accuracy** refers to the closeness of the estimates to actual 
ability; It may vary systematically with ability level* Another Interesting 
property of estimates Is bias, or error of central tendency* Two kinds of 
bias should be of some concern; 1) unconditional bias, or group mean error 
of estimate; and 2) conditional bias* or mean error of estimate at a given 
level of the parameter being estimated* 



^By "validity" here Is meant the correlation of the ability estimates with 
actual ability* Green (1975) suggested use of the term "fidelity" In this 
context to denote validity coefficients obtained from monte carlo simulation 
studies* Greenes convention will be followed here* 
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Purpose 

The purpose of the present paper is to report the results of a 
series of simulation studies designed to investigate the influence of 
guessing and item pool characteristics on the bias, accuracy, and other 
properties of the trait estimates derived from Owen^s Bayesian sequential 
testing strategy. 

The studies reported below were motivated by results obtained with 
live testing of Owen's strategy. Using Owen's testing strategy with 603 
college students and a 329'item pool of vocabulary knowledge test items, a 
correlation of .84 was obtained between estimated ability level and number 
of test items to termination. Simulation stu(^ies then were designed to 
investigate the influence of item pool characteristics on that unexpectedly 
large correlation. 

The simulation studies reported here were intended to explore both 
the properties of the Bayesian sequential testing method itself and 
properties of the resulting ability estimates. The former properties are 
investigated best by sampling from ''populations'' of simulated examinees 
whose distribution on the ability dimension approximates in form and 
param^ ters (mean, variance) the population assumed by the testing procedure — 
here, a i^rmal population with mean 0 and variance 1. The first three 
studies reported sampled examinees from such a population. These studies 
were designed to investigate the effects of guessing, of item discriminating 
power, and of two different test termination criteria on certain group 
statistics. The independent and dependent variables of interest in each 
study are described separately below* 

The fourth study focused on certain properties of the test scores 
as estimators of the ability underlying the item responses under varying 
conditions. This area of inquiry required sampling large numbers of 
examinees at regular intervals throughout the normal range of the trait. 
The details of this study are likewise described separately below* 



Study 1: An Ideal Item Pool with Variable Test Length 

Background and Purpose 

Jensema (1972) simulated Bayesian test administration to examinees 
sampled from a normal [0,1] distribution using two different "ideal" 
lOO'item pools* These pools were "ideal" according to Jensema's prescription 
that items for use in this testing strategy should have high discriminations 
and should be rectangularly distrib-ited in their dif ficulties^ The first 
pool had four items available at each of twenty-five equally spaced 
difficulty levels in the interval -2.45b52.4; all items had guessing 
parameters of a^*20 and discriminations of a=»8. A second item pool was 
identical to the first except for the value of the constant discrimination 
parameter, which was a^l*60* The Bayesian test was simulated as proposed by 
Owen (1969), with the parameters of the initial ability distribution set 
at [0,1] for each examinee. Testing terminated for eacn examinee whenever 
the posterior variance of the ability estimate diminished below a 
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predetermined value or after thirty items, whichever occurred first* 
Jensema set t'.ie critical posterior variance value at »0625, which corres- 
ponds tc a standard error o£ estimate o£ *25, and hence to a fidelity 
coe££icient exceeding »968 (Jensema, 1972, p» 114)» Jensema's obtained 
fidelity coefficients and mean test lengths, obtained from simulations 
using random samples oC 100 examinees, are listed in Table 1* 



Table 1 

Mean Test Lengths and Obtained Fidelity Coefficients for 
Two Siimilated Bayesiait Sequential Tests, 
Distinguished by their item Discriminating Power (a) 
(frow Jensema, 1972) 







Mean 


Fidelity 


a 


Test Length 


CoeCCicient 


.80 


30* 


.93 


1.60 


17.5 


.97 



*No tests achieved the posterior variance termination 
criterion in this condition* 



Jensema (1972) did not report, however, some properties of the Bayesian 
sequential testing procedures which are of practical interest* The 
purpose of the present study was to replicate Jensema's research with 
these same two **ideal** item pools, while studying some other properties 
of the ability estimates in addition to fidelity and mean test lengths 

Method 

A Variables * Dependent variables were the individual ability estimates 

(0) and the number of items (k) required to satisfy the posterior variance 

termination criterion of i»0625» Independent variables were the simula;ed 

m 

examinees' abilities (0) and the discriminating power (a=-*80 or 1*60) of the 
items in the simulated item pool* 

Examinees' abilities were simulated by computer-generation of 100 
random numbers (0^) from a normal population with mean 0 and variance 1* 

The same 100 "exaainees** were tested with both item pools* 

Item pools * Two 100*item "ideal" item pools were simulated, 
corresponding to the ones used by Jensema (1972)* In each pool there were 
four items at each of twenty-five difficulty levels (b) equally spaced 
in the interval [-2*4<i'<+2*4] * The guessing parameter (c) was constant 
across items; for both pools, c=*20. The item pool for the first test had a 
constant discrimination parameter of a-=*80 across items; the second pool 
employed a constant item discrimination parameter equal to a=l*60* 
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Thus, for each test administration an item pool containing 100 distinct 
item'^ was simulated; each item g could be characterized by its parameters 

g g g 

Response generation and test administration ^ Item responses were 
simulated by calculating, for each item-examinee administration, the 
probability of a correct response to the item given the simulated ability 
(@^) and the item parameters a^, b^^ using equations presented by 

Betz & Weiss (1974) and Vale & Weiss (1975)* This probability P (0 ) was 

g i 

compared with a random number v^^ generated from a uniform distribution 

in the interval 10,1]. A score of 1 ("correct") for examinee i on item 

g was assigned if P (0^)^^ otherwise a score of 0 was assigned. 

g i gi 

Test administration was simulated exactly as proposed by Owen 
(1969)* For each examinee an initial ability 0^«O was assumed, and the 

prior distribution was assumed to be normal [0,1]. The optimal item in 

the pool wafi selected based on the item parameters, and its administration 

to the exaininee was simulated. Based on the item score (1 or 0), the 

parameter3 (u , c^) were updated, and another item was selected and 
m m 

administered. This recursive procedure was repeated until 30 items 

had h^'en taken by the "examinee", or until was smaller than .0625, 

m 

whichever occurred firsL. Once any particular item had been taken by 
the examinee it was not reused At test termination, the examinee^s 
simulated ability (0^), t.:e Bayesian estimate (^^)f ^nd the number of 

items taken (k) were recorded. 

Evaluative criteria . For each of the two test administrations, 
after all 100 examinees* tests were simulated, the following properties 
of the sequential test were estimated from the data; 

a. the bias, or mean algebraic error of the ability estimates; 

i»l 

b. the accuracy, or mean absolute error of the estimates; 

t=l 

c. Vq^9 the correlation of test length with ability; 

rgj^, the correlation of test length with estimated ability; 
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d* r^^, the correlation o£ the algebraic error« of estimate 
(§j^-e^) with ability; 

rg^, the correlation of (e^^6^) with estimated ability; 

e, r^^t the fidelity coefficient; 

f. the mean, minitnum and maxitnum test length required to achieve 
the posterior variance termination criterion. 

Results 

Table 2 contains the results from Study 1. As Table 2 shows, there 
was positive bias (,06 and *05) in the group scores for both tests, 
indicating that ability was overestimated, on the average. Mean absolute 
error was ,26 for the a=,80 item pool and ,19 for the more discriminating 
item pool; in these data, then, the more discriminating item pool estiuiated 
ability with smaller average error. 



Table 2 

Properties of the Bayesian Sequential Test for Two Values of Item 
Discrimination, with Corrected Guessing and Ideal Item Pool 



Item Discrimination ja) 



Prop^irty 


.80 


1.60 


Test Length 






Mean 


30* 


18 


Minimum 


30 


12 


Maximum 


30 


30 


Errors of Estimate 






Mean (Bias) 


.06 


.05 


Mean Absolute Error 


.26 


.19 


Correlates 






> 


-.35 


-.40 


> 


-.07 


-.21 




** 


.84 


?^ 


** 


.85 


66 


.96 


.98 



^An arbitrary maximum test length of 30 items was imposed. 
**There was no variance on test length in the a^^BO test. 

However 6 and § correlated ,81 and *84 with posterior 
variance. 



Mean test length for the a-, 80 item pool was 30 items, with no 
variance, indicating that the posterior variance termination criterion 
never was reached using this item pool. The liigher discriminating pool 
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(a«l*60) required a mean test length of 18 Items, with a range of from 12 
to 30* For this Item pool test length correlated ,84 and .85 with ability 
and the ability estimator, respectively. This strong positive correlation 
was essentially the same as was found In the live-testing results* It 
Indicates that despite the "Ideal" construction of the Item pool, the test 
required substantially larger numbers of Items to achieve the termination 
criterion as ability Increased* (Since there was no variance In test 
length for the a'=*80 Item pool, the test length correlations cannot be 
evaluated under that Item pool configuration*) 

Errors of estimate (§^^"6^) correlated -,35 and -*40 with ability 

for the two Item pools, which could Indicate a tendency to underestimate 
ability at high levels and to overestimate i.t at low levels* This, of 
course. Is a phenomenon typical of regression estlmatesj the Bayeslan 
test scores seem to be acting like regression estimates In this regard* 
This same tendency was evident to a smaller extent In the correlations 
between errors and ability estimates (^^g^)* 

The fidelity coefficients (r^g) were ,96 and *98, respectively, 

for the (3F*80 and a=l*60 Item pools* These were slightly higher than those 
obtained by Jensema (see Table 1)* The differences are likely due to 
random fluctuations resi^ltlng from the relatively small sample size of 
100 simulated testees (see Betz & Weiss, 1974, pp. 20-21 and 24-25), 

Conclusions 

The replication of Jensema's study of the Bayeslan sequential 
test using these two Item pools corroborated his findings with regard to 
fidelity and mean test length* The fidelity coefficients obtained In the 
present study were slightly higher than hls^, while mean test lengths 
were almost Identical, It seems clear that Onen's adaptive testing procedure 
has the potential of achieving measurement of high fidelity with relatively 
short tests* However, the strong correlation between ability and test 
length suggests a potential problem If the Bayeslan test Is used In a 
group of higher ability than Is assumed beforehand* Additionally, the 
overall positive bias of the trait estimates suggests that additional 
study of the testing procedure is required before Its scores are used 
directly as estimators of ability* However, the generality of the results 
of Study 1 Is limited to "ideal" Item pools with rectangular distributions 
of the difficulty parameters and with the same discrimination and guessing 
parameters as In the present study* 



Study 2: Effects of Guessing and Item Discrimination 
in a Perfect Item Pool 

Backf^round and Purpose 

The discovery In Study 1 of positive bias in the Bayeslan trait 
estlm'^tes, and of a strong positive correlation between ability and test 
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length in the a"!. 60 item pool, raises the question of the generalizability 
of these phenomena. These results might be due to sampling fluctuations, 
to the specific item parameters employed, to the effects of random guessing, 
or to characteristics inherent in Owen's sequential testing procedure. 
Study 2 vas designed to test the generality of the results of Study 1. 

In Study 2 many sequential tests were simulated by varying the 
discriminating power of the item pool and the effect of guessing. 
Further, in order to avoid loss of generality due to a specific range 
of the distribution of item difficulty values in the item pool. Study 2 
simulated a "perfect" item pool — one behaving as though it con t a ine d an 
unlimited number of items at any specifiable difficulty level. The results 
of Study 2, therefore, should reflect the best attainable results under the 
Bayesian procedure, given the guessing and discrimination parameters of 
the items. 

To evaluate the effects of guessing on testing strategy characteristics, 
test administration was simulated under the three different guessing condi- 
tions described below-^o guessing, uncorrected guessing, and corrected 
guessing. Under each of these conditions fourteen "perfect" item pools 
were simulated. These differed from one another only in their item discrim- 
inating powers. Thus, fourteen values of a were used; a was constant within 
any test simulation, but varied across tests. The same properties of the 
test procedure studied in Study 1 were of interest in Study 2. 

Method 

Variables . Dependent variables in Study 2 were the same as in Study 
1: ability estimates (0) and test length (k)* Independent variables were 
simulated ability (6), discriminating power of the item pool, and the 
effect of guessing and of scoring for guessing. 

To study the effect of guessing, three different conditions were simu- 
lated: 

1. No guessing; in the item response model, c was set to 0, 
and was assumed to be zero in the Bayesian scoring formulae 
(Equations 1 through 4). 

2* Uncorrected guessing; c was set to .20 in the item response 
model, but was assumed to be zero in the Bayesian scoring 
formulae * 

3. Corrected guessing; c was set to .20 in both the item 
response model and the Bayesian scoring formulae. 

Under each guessing condition, fourteen test administrations were 
simulated^ These differed only in the constant value of the item discrimi- 
nating powers in the respective item pools. The fourteen values used were 
cr- .5, .6, .7, .8, .9, 1.0, 1.25, 1.50, 1.75, 2.00, 2.25, 2.50, 2.75, and 
3,00. For each test administration, the same 100 simulated ability values 
used in Study 1 constituted the examinee "group". 
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Item pools . The "perfect" item pools were simulated by calculating, 
for each examinee after each item response was scored^ the optimal diffi- 
culty value of the next item, given a , a and the current ability 

g S 

estimate. This optimal item difficulty was determined using a formula 

given by Birnbaum (1968, p. 464) for calculating the difficulty level at 

which maximal item information occurs, given a. and assuming that 

§,"0,. With a constant and when no guessing is assumed {c «0 in the 
i i 8 ^8 

scoring formula), the optimal item is one with b.^^Q . When guecsing is 

mxi m 

assumed^ the optimal difficulty (b^^ is smaller than 5^, by an amount 
which is inversely proportional to a^. 

After the "optimal*' item difficulty value was calculated, the 
computer simulation program generated a hypothetical item with that 
difficulty value, then ^^administered" it to the examinee. Thus, the 
hypothetical item pool literally had available an unlimited number of 
items of any difficulty value specified by the sequential testing 
procedure. 

Response generation and test administration . Item responses were 

simulated in the same manner described in Study 1. Test administration 

was identical with Study 1> except for the item difficulty generation 

procedure. The same posterior variance criterion (0^^.0625) was used as 

m 

a test termination rule. Unlike Study 1, test length was free to exceed 
30 items; a maximum length of 100 it^s was imposed. At test termination, 
ability (0^), the ability estimate (9^)> and the number of items adminis^ 

tered {k) were recorded for each examinee. 

Analysis . A total of 42 test administration conditions were simu^ 
lated — 14 "item pools" under each of the three guessing conditions. For 
each test administration, the same sequential test properties estimated 
in Study 1 were estimated: bias, mean absolute error, r^^^ r^^^ r^^, 

^Se* ^OS' ^^Tv and range of test length. 

Results 

No-guessing condition . As Table 3 shows, test length was constant 
within item discrimination level under no-guessing, and diminished 
inversely with level of item discrimination. The posterior variance 
termination criterion was reached for all examinees using every item pool 
except the one having a^.50. As a point of comparison with Study 1, test 
termination was achieved in fewer than 30 items for item pools having 
ail. 00. There was no corrfiation between test length (k) and O or 5, 
since there was no variance in test length for any test administration. 

The overall bias of estimate under the no-guessing condition was 
practically zero for all but the highly discriminating item pools (see 
Table 3 and Figure 1). Mean absolute error was .17 for a". 5 and increased 
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fairly steadily to .22 for the a'"3*00 item pool* For the no-guessing 
cop'^ition* then> there is a tendency for the highly discriminating item 
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pools to yield larger average errors than the inoderately discriminating 
item pools* 



Figure 1 

Bias and Mean Absolute Error as a Function of It6m 
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As in Study 1, errors of estimate (©^^Q^) correlated negatively with 

0 (-*27 to -*39) and with 0 (-*08 to -*20)* Again* these correlations 
suggest a regression effect* 
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The fidelity coefficients were all or .98, as "predicted" by the 
posterior variance termination criterion value. Interestingly, the lower 
fidelity coefficients occurred at the higher item pool discrimination 
values* 
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Uncorrected -guessing condition * As Table A shows^ the tesJ U*ngtli data 
were identical with those obtained under the no-^guessing condition* Table 4 
and Figure 2 show that both mean algebraic errors (bia^^) and ah^^olute errors 
were quite high (*57, *58) for the a=*50 item pool and decreased as a in- 
increased, to about a=l*25. For a>1.25 the mean errors seemed to level 
off> with moderately large values for both bias and absolute error. 



Figure 2 

Bias aTvd Mean Absolute Error as a FUTvction of Icom 
Discriminations* for the Perfect Item Pool wlfh 
Uncorrected fUiestsing 
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As before, errors of estimate correlated negatively with ability; the 
magnitude of the correlations were large for a"*50, then decreased as a 
increased, until approaching a constant value at a>l*75* Again, these 
correlations suggest a regression effect* The correlations of errors with 
ability estimates, r-^, followed a different trend under this condition 

than was seen previously; rg^ was -,29 for a»*50, then showed a steady 

algebraic increase with a, to a value of *07 at a"2*75* 

Fidelity coefficient values were everywhere lower with uncorrected 
guessing than with corrected guessing, and decreased steadily from *97 to 
»91 as a increased* As expected, fidelity increased with test length* 



Figure 3 

Number of Items to Termination^ with .20 Cuesslng 
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Corrected-guessing condition . Figure 3 graphically depicts test 
length as a function of item discriminatory power (a). The vertical bars 
in Figure 3 indicate the range of test length at a given a-level; the dot 
indicates the mean test length for that level* As Table 5 and Figure 3 
show, some variance in test length was present for all a levels except 
a=*50 (where the termination criterion never was reached). Mean test length 
to termination varied inversely with item discrimination, as In the other 
conditions. Even with this perfect item pool, the termination criterion 
was achieved in fewer than 30 items only for a>1.00. 

As Figure 4 shows, the bias of estimate was small but positive under 
the corrected guessing condition, increasing to meaningful levels only as 
item pool discrimination exceeded a»2.25. Mean absolute error was almost 
constant across levels of a. 
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As was seen in Study 1, test length correlated strongly with ability 
(and ability estimates) where it was free to vary (Table 5)* Since test 
termination takes place only after a specified reduction of the posterior 
variance has occurred, the large positive r^j^ correlations indicate that 

the rate of posterior variance reduction is a function of ability level, 
with more rapid reduction taking place as ability (Q) decreases^ 



Figure ^ 

Bias and Mean Absolute Etrot as a Function of item 
Discriminations^ for the Perfect Item *^ool with 
Cortected Guessing 



>40 



.30- 



UJ 



z 
< 

UJ 

2 



.10 • 



mean absolute error 




bias (&-^) 



—I 1 J ^ 

to 15 20 25 

DISCRIMINATION (a) 



19 



— I — 

3.0 



-15- 



As seen under the other conditions, Table 5 shows that errors of 
estimate correlated negatively (-*25 to -*42) with ability and with ability 
estimates (-*09 to --23)» As in the no-guessing condition, all fidelity 
coefficients were .97 or »98, with the lower value occurring at the higher 
item discrimination levels* 

Conclusions 

Study 2 supports the findings of Study 1 and extends them somewhat* 
As in Study 1, the Bayesian testing strategy resulted in very high fidelity 
coefficients with relatively short tests, provided the item discriminating 
powers were 1*0 or greater* The Study 1 finding of positive overall bias 
of estimate was corroborated here; Only one of the forty-two bias estimates 
was negative* Especially noteworthy was the effect of uncorrected guessing 
on both the ability estimates and the fidelity coefficients; Bias was 
severe, and fidelity actually decreased as discriminating power increased. 

Under the corrected-guessing condition, the finding of a strong 
positive correlation between test length and 0 or 0 was replicated consis- 
tently* It is important to note that this condition was obtained under 
conditions of a "perfect" item pool; this implies that the high correlation 
does not result from inadequacies of the item pool* Since there was no 
variance in test length when no guessing was assumed (i*e*, for the no- 
guessing and uncorrected-guessing conditions), the phenomenon would seem 
to be due to the scoring formulae in some way* The phenomenon by itself 
is of little concern unless it results in different measurement properties 
at different levels of ability* This may be the case; some of the proper- 
ties of the sequential test seem to improve with test length* If test 
length is consistently greater as ability increases, then the test may be 
measuring less well as ability decreases, due simply to the effects of test 
length* 



Study 3; Effects of Fixed Test Length 



Background and Purpose 

The results of Study 2 make it obvious that with guessing a factor, 
test length increases with ability level when the posterior variance cri- 
terion is used to terminate testing* It was suggested that some measure- 
ment properties of the test may suffer as a consequence* Tt/o properties 
which seem to be affected adversely by short test length are bias and mean 
absolute error, both of which increased as item discrimination became very 
high (and test length very short) in the no-guessing and corrected-guessing 
conditions (see Tables 3 and 5)* Another property which should be 
adversely affected by very short test lengths is fidelity* Study 2 noted 
a small but consistent decline in fidelity at the very high discrimination 
levels (see Tables 3, 4 and 5)* Additionally, Jensema (1972) noted a 
similar phenomenon, which he termed "correlation drop-off'** 
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This study explored the effect of administering the same number of items 
to all examinees, on the same properties which were of interest in Studies 
1 and 2* This was done by means of simulating fixed-length Bayesian tests 
for the corrected-guessing condition, under various item discrimination 
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levels. To avoid loss of generality, the ''perfect'' item pool was again 
employed. 

Method 

Variables , Dependent variables were the ability estimates (0) and the 
posterior variance (a^) after a fixed number {k) of items had been adminis- 
tered. Independent variables were simulated ability (0) and item discrimi- 
nating power. Nine levels of discriminating power were studied: a =,6, 

,8, 1,0, 1,25, 1,50, 1,75, 2,0, 2,5, 3,0, Examinees were the same 100 
simulated ability values (6^, i"l, 2, ,,, 100) used in Studies 1 and 2, 

Item pools , "Perfect" item pools were simulated, as described in 
Study 2; i,e,, the locally optimum item difficulty was calculated after each 
item response, and an item having that difficulty level was artificially 
generated and administered. 

Response generation and test administration . Item responses were simu- 
lated in the same manner as in Studies 1 and 2, Test administration was 
identical with Study 2, except that all "examinees" were administered 30 
items. After 30 items, the individual ability (@^), the estimate (0^), and 

the posterior variance (^3q) were recorded for each examinee. 

Analysis . A total of nine test administrations were simulated (one at 
each item discrimination level). For each administration thet^e sequential 
test properties were estimated as described in Study 1: bias, mean absolute 

error, , r a f and v^^* Additionally, for each administration, the corre- 

Qe &e 00 

lations of the posterior variance with 0 and 0 were calculated. 
Results 

Table 6 and Figure 5 contain the results of Study 3, To facilitate 
comparing the 30-item test length with the posterior variance termination 
criterion, comparable data from Study 2 are included in Figure 5, 

As Figure 5 shows, the overall bias of estimate was virtually zero in 
all item pools, except for the a'^^SO a.^d a'2,5 item pools. Mean absolute 
error decreased steadily as a function of a, and was lower for fixed test 
length than for the variable test length conditions^for all discriminations 
larger than a»l,50. As In Studies 1 and 2, error (9^-0^) correlated ^ 

negatively with 0 and G, suggesting a regression effect. 

As Table 6 shows, the posterior variance correlated positively with 
e and §, with the magnitude of the correlation generally diminishing as 
a increased (e*g*, 2 was ,86 for a«,6, and ,74 for a-3,0). This trend 

corresponds to the one seen in Studies 1 and 2 — test length correlates 
strongly with ability when posterior variance is held constant. 
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The fidelity coefficients increased with the item discriminating 
power, from .93 at a-. 60 to .99 at a-1.5 and higher. 



Table 6 

Errors of Estimate and Correlates of the Bayesian Sequential Test Ability 
Estimates as a Function of Item Discrimination, for 30-Item Test Length 
and Corrected Guessing, with Perfect Item Pool 

Item Discrimination (a) 



Property .6 .8 1.0 1.25 1.5 1.75 2.0 2.5 2.75 

Errors of Estimate 

Mean (Bias) .09 .01 -.01 .02 -.01 .00 .01 .04 .01 

Mean Absolute Error .33 .28 .21 .17 .15 .12 .12 .12 .09 

Correlates 
With Error 

^ee -.41 -.30 -.36 -.34 -.40 -.32 .32 -.51 -.36 

^Oe -.04 .01 -.13 -.15 -.24 -.19 -.18 -.36 -.23 
With Posterior Variance 



^0^2 .86 .85 .89 .81 .82 .77 .69 .76 .74 

m 

^§o2 .93 .90 .90 .84 .82 .79 .69 .72 .73 

m 

Fidelity 

^60 .93 .95 .97 .98 .99 .99 .99 .99 .99 
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Conclusions 



It is apparent that some improvement in the properties of the 
Bayesian testing procedure can be realized by setting test length constant, 
provided that item discriminatory power is sufficiently high (e*g*, 
greater than a^l*5)* Bias seems to be diminished, and absolute error 
decreases as discrimination increases* 



Study 4; Effects of Ability Level 
and Item Pool Configuration 

Background and Pupose 

Simulation studies of Owen^s Bayesian sequential test procedure 
typically have concentrated their attention on group statistics* For 
example, Urry (1971, 1974) and Jensema (1972, 1974) evaluate their results 
in terms of fidelity coefficients and mean test length (using a posterior 
variance termination criterion)* Studies 1, 2, and 3 above have extended 
Urry's and Jensema 's work by examining additional properties of the sequen- 
tial testing procedure, but they also concentrate on group statistics* 
With any group statistic, such as a fidelity coefficient, a bias estimate, 
or a mean test length, there is a lack of invariance across groups* A 
change in the shape of the distribution, or the central tendency and varia- 
bility, may alter the magnitude of the group statistic markedly* Therefore, 
some distribution-free methods for evaluating the Bayesian sequential 
adaptive test are needed* One general method for this is to examine char- 
acteristics of the test as a function of ability level* 

Given that some properties are to be evaluated as a function of 
ability level, it is necessary to select the properties of interest* Tlie 
results of Studies 1, 2, and 3 suggest some characteristics of Owen's 
procedure which bear further investigation* For instance, there was a 
tendency in the preceding studies for positive bias to occur, i*e*, for 
the group average ability estimates to be larger than the average ability* 
Additionally, there was consistently a moderate negative correlation 
between ability and the errors of estimate, indicating a regression effect* 
The negative correlation between the estimates themselves and their error 
further suggests that the regression may be non-liuear* The strong positive 
correlation between test length and ability indicates that the posterior 
variance estimate is being reducaa nwre rapidly at low ability levels than 
at high ones, despite the use o£ the "perfect" item pools and the presence 
of constant item discrimination across all difficulty levels* 

Based on the findings of Studies 1» 2, and 3, the present study 
examined appropriate properties of the Bayesian sequential testing strat- 
egy as a function of ability level* These properties include the form of 
the regression of ability estimates on 0, the conditional bias of the 
ability estimates, and mean test length* In addition, this study included 
estimation of the "information" (Birnbaum, 1968) in the Bayesian test 
ability estimates at various levels of ability* 



In addition to estimating the regression, bias and information in the 
Bayesian test scores as a function of ability, this study examined the 
effect which different item pool configurations** might have on these 
properties* Item pool configuration here refers to the regression of item 
discrimination (a) values on the item difficulty (fc) values in the item 
pool* Studies 1, 2, and 3 above, and all previous research using **ideal** 
item pools, have simulated item pools in which was constant across items 
or in which cl was statistically independent of The presence of no 
statistical association between a and b implies that the same item infor- 
mation (Birnbaum, 1968, P* 449) is available at all levels of item 
difficulty* On the other hand, if there is a statistical relationship 
between the discrimination and difficulty values of the items in a given 
item pool, there will be' more information available in some ranges of the 
ability continuum than there is in others* 

Although in theory it is desirable for adaptive testing to assemble 
an item pool having equally discriminating items at all the difficulty 
levels represented, in practice this has not always been achieved* For 
instance, the 58~item pool used by Jensema (1972) to simulate adaptive 
testing based on some items from the Washington Pro-College examinations 
had very highly ditjcriminating items in its upper difficulty ranges and 
low*to-moderately discriminating items in the easy range of difficulty* 
Similarly, Lord (1974) reported that the discrimination parameters of his 
item pool correlated positively with the difficulty parameters* Practical 
implementations of adaptive testing are likely to use item pools in which 
the configuration of the item parameters is less than ideal* Therefore, 
the effects of different item pool configurations on the psychometric 
characteristics of the test scores (or trait estimates) need to be inves** 
tigated* 

This study investigated three different conf iguratioixs of the item 
pools* Each configuration was characterized by a different slope of the 
regression of item discrimination parameters on item difficulty, which in 
turn can be characterized approximately in terms of the correlation, v^^f 

between item discriminating power and difficulty* Identical test simulation 
studies were conducted under all three configurations in order to evaluate 
any differential effects* 

Method 

Variables * Dependent variables were the ability estimates (§) and the 
number of items {k) required to satisfy the test termination criterion* 
Independent variables were the simulated examinees* abilities (0^) and the 

configuration of the simulated item pool* Examinees' abilities for each 
test administration were simulated by 3100 values of 0^, 100 at each of 31 

equally spaced levels in the interval [-3*Oi0i+3*O] * This examinee distri- 
bution was used because of the need for relatively large numbers of obser** 
vations at each level of 0 in order to estimate accurately the regression 
of ability estimates on ability, the conditional bias, and the information 
curves* 
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Item pools ^ Three "perfect** item pools were simulated — one for each 
configuration* The three configurations studied included one with a 
moderate positive correlation of a with b (referred to hereinafter as 

one with a moderate negative correlation (^^j^")* ^"d one with no 

correlation (I'^jjO)* The r^^+ configuration favored the more difficult 

items with higher discriminating powers, the r^^- configuration favored the 

easier items, and the ^^^^ configuration favored no difficulty levels* 

As in Studies 2 and 3, after each item response the optimal difficulty 
of the next item to administer was calculated, and an item having that 
difficulty value was artificially generated and administered* In the 
previous studies, the optimal difficulty calculation was based on the 
guessing parameter (a) and on the constant discrimination parameter (a) of 
the items in the pool* In this study, the same calculation was based on 
the man item discrimination parameter (3), which was 1.25 for all configu- 
rations* In all cases, a was *20* 

The item pool configuration was simulated by; 

1* Selecting the appropriate b for the next item from the 

S 

perfect item pool as though all were equal to a^; call 
this &Y^^g'^m' 

2* Calculating a conditional value from a linear transform 
of b* 

g 

where S.D.^ is the standard deviation of the parameters 
in the simulated pool; 

ff*P*_ is the standard deviation of the b parameters in the 
Simulated pool; 

f b* f ^ uf are as previously defined; 
g g ao g 

3* Adding an error component, e , to the approximate a , so that 
for each item administered ^*g"^gl^*g'^g 

where a* is the simulated discriminating power of the item; 

a lb* is the approximate discrimination defined above; 
g g 
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is a random number from a normal {0, a^^] population, 
such that 

4* Setting a*g equal to ,80 whenever it would otherwise have a 
lower value* 

Response generation and test administration . Item responses were 
simulated In the same manner described In Study 1. Test administration was 
Identical with Study 1. A posterior variance termination criterion of 
0^^*0625 was used, with an arbitrary maximum test length of 30 Items* The 

corrected-guesslng condition was used* At termination, the ability (0^), 

Its estimate (0^), and the number of Items administered (k) were recorded 

for each examinee* 

Analysis * For each of the three simulated test administrations, the 
following properties of the sequential test were estimated from the 100 
observations at each separate ability level (0^); 

a* the conditional mean, ^il^i^ioO^^l ^^^^ 

2 1 a * 

b* the conditional variance, ^'g^jg *100^^ l"®l^ ^^^^ 
c* the conditional bias, 0^=0^-6^ [13] 
d* the conditional mean test length, ^|0^* 

The regression of the trait estimates (0) on ability (o) was estimated 
by fitting a third degree polynomial to the 31 conditional means, using a 
least squares method* The regressions of bias and test length on q were 
estimated graphically* 



The Information In a set of test scores (x) can be defined as 

2 

[W] 



I (0) 

X 



^(g(xl0) 



^|0 
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The "information" value of test scores at any level of ability Is an index 
of i.he usefulness of those scores for discriminating among examinees In the 
vicinity of that level* A zero Information value Indicates that the test 
scores are useless for making discriminations about a given point; an 
infinite Information value Indicates that error-free discriminations can be 
made about that point on the basis of the test scores* Any value between 
the two extremes has implications for the probability of making Type I and 
Type II errors In classifying persons above or below the point In question* 
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The numerator in Equation 14 is the first partial derivative o£ the 
function describing the regression o£ test scores (x) on the trait (0). 
The denominator in Equation 14 is the conditional standard deviation oC the 
scores. The regression oC test scores on 0 can be approximated Crom 
empirical data, i£ the scores (x) and the latent trait values (0) are known. 

Since the Bayesian trait estimates (0) can be treated as test scores, 
the numerator oC the information function can be evaluated at any point (0') 
Crom the slope oC the equation Cor the regression oC 6 on Q, That equation 
was calculated Crom the simulation data as described above. In estimating 
the inCormation curves, the Cirst partial derivative (i^e,, the slope) oC 
that polynomial equation was evaluated at each oC the 31 & points used in 
the study. The denominator oC the inCormation Cunction at each oC the 
same 31 points was estimated by the square root oC the conditional variance 
oC the trait estimates at that point. 



Figure 6 

Mean Esclmaced Ability (0) at 31 Ability Polnrs (^) 
for the Simulated Bayesian Sequential Test under 
Three Item Pool Configurations 
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Thus Cor each oC 31 points 0^, the inCormation at that point, -^^(0') 
was estimated Crom the test simulation data, as 
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[15] 



where ff(0|0') is the third degree polynomial regression fitted to 
the 31 test score means 

o{Q\Q^) is the square root of the observed variance of the 
100 test scores at . 



Results 



Regression of 9 on 0 ^ Figure 6 is a plot of the observed mean ability 

estimates (5) as a function of actual trait level (6) differentiated by item 

Vcol configuration; Appendix Table A'-l shows the nuinerical values of these 

ineans. For each configuration, then, Figure 6 contains the graphic empirical 

approximation of the regression of 0 on 0* The values for each item pool 

configuration form an essentially linear plot for levels of 6 between +1 and 

-1) with a tendency toward departure from linearity for values of 0 larger 

than +1 and smaller than -1* High abilities are underestimated; low abilities 

are overestimated* The exaggeration of this effect seems strongest for the 

r ~ configuration, in which the average item discrimination increased as the 
ab 

ability estimates decreased* 
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Figure 7 

Mean Error of Estimate (S-O) at 31 Ability Points (6) 
for the Sionilated Bayeslan Sequential Test under 
Three Item Pool Configurations 
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Bias* Figure 7 contains the plot of conditional bias (inean (0-^0)) on 
ability (numerical values are in Appendix Table A-1 as e)* For each 
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conflguratlon, the curve described by thest data Is non-linear* As Figure 6 
showed indirectly, the conditional bias for all three configurations was 
close to zero for -liOil, but It Increased with Increases In absolute values 
of e elsewhere* A strong tendency to underestimate high 0 was present In 
all three configurations, and was severe for ^^^-f for which the bias was 

-♦43 at 0»3»O» The tendency to overestimate low © was even more pronounced, 

and was severe for all three Item pool configurations* For the i* ,.0 

ab 

configuration the conditional bias at 0»-3 was *53; for r the bias at 

ab 

the same point was *61* If the 6 metric Is expressed In population standard 
deviation units, then, the Bayeslan sequential test estimates may typically 
err by one-half standard deviation unit at low extremes of the ability range 
and by a lesser but still significant amount at the high extremes* Further- 
more, this tendency Is systematically affected by the configuration of the 
Item pool* 



Figure 8 

Mean tfumber of Items to Termination (Test length) at 31 
Ability Points (O) for the Simulated Bayesian Sequential 
Test under Three Item Pool Configurations 
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Figure 8 contains plots of mean test length as a function of ability 

level for each Item pool configuration (numerical values are In Appendix 

Table For the P *0 configuration, test length was constant at 30 

ab 

Items, the arbitrary maxlmuTD* For i^^j^+f where the most discriminating Items 

were available at the higher difficulty levels, test length was constant at 

30 Items for 0 levels less than *6, then declined gradually to a mean of 23 

Items at (K3. The r configuration, which had higher Item discrimination 

ab 

at the lower difficulty levels, showed a trend opposite that for P^j^+* For 
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^ab"' ^^^^ length la":reased rapidly with 0 from a mean of 14 Items at 

0*-3, to 30 Items at G-O; for all 9 greater than zero, the test length was 
30 Items, the arbitrary maximum. 

Figure 8 Illustrates two Interesting trends. First, not only did the 
r^lj" configuration use fewer Items than the others, but the rate of Increase 

as 0 Increased Is noticeably steeper than the rate of decline In test length 
for 1*^1^+* Second, for l^^^+f which required the fewest Items at high 0 

levels, bias (see Figure 7) was least pronounced at hlgji 9 levels; yet for 
r^^-, which required fewest Items at low 0 levels, there Is no apparent 

advantage at those levels In terms of bias. 
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Figure 9 

Smoothed Information Curves for the fiayesian Sequential 
Test under Three Different Item Fool Configurations 
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Information . Figure 9 contains smoothed Information curves for the 
three Item pool configurations. (Numerical values of the estimated slopes, 
conditional standard deviations, and Information values at each of the 31 
0 levels are shown In Appendix Table A-2.) For the r^j^O configuration the 

Infonnatlon curve shown In Figure 9 Is convex, reaching Its maximum height 
very near 0*0; the curve slopes gradually downward as 0 Increases above 0, 
and more rapidly downward as 0 decreases from 0. At O^-S the Information 
curve Is quite low. Indicating that despite the availability of test It^s 
at all difficulty levels, the test scores will discriminate very poorly In 
the low ability ranges. 
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For the r^j^+ configuration the Information value at 0=-3 Is even 

lower, but It Increases steadily — alinost linearly — with 0» The r ,+ 

ab , 

Information curve surpasses that of r^^^O at 0>+l, as expected from the 

avallabllltv of more discriminating Items In the higher difficulty ranges* 
For the configuration, which had Its lowest Item discriminations In 

the higher difficulty ranges, the Information curve Is quite low at high 
ability levels, and It Increases steadily as 6 decreases, to about 0-=O* 
Surprisingly, the Information curve thereafter decreases with 0, reaching 
its lowest point at 0=-3» This is a striking result in view of the avail- 
ability of more discriminating items at low 0 levels for the V , item pool* 

ab 

It can be partly, but not entirely, accounted for by the shorter test lengths 
seen for the configuration at the low ability levels* ' 



General Summary and Conclusions 

Previous research (e»g», Urry, 1971, 1974; Jensema, 1972) lias shorn 
that Owen*s Bayeslan sequential approach to adaptive testing has the 
potential of achieving very high correlations between ability level and 
ability estimate concomitant with a significant savings in test length, 
compared to conventional testing procedures* In order for this potential 
to be realized, a relatively large item pool was required, with highly 
discriminating items (a>*80) rectangularly distributed on the difficulty 
continuum (Urry, 1974). Study 1 corroborated the findings of Urry and 
Jensema in terms of test length and values of the fidelity coefficients. 
At the same time Study 1 revealed an overall tendency for the Bayeslan 
trait estimators to overestimate group mean ability level* Also, the 
results of Study 1 corroborated the finding in live-testing that with 
Owen*s strategy test length covarles positively with ability level* 

The results of Study 1 were not definitive, partly because finite 
item pools were employed. Study 2 overcame the specificity of Study 1 by 
introducing the use of a "perfect*' (or infinite) item pool, having unlim- 
ited numbers of Independent It^s at any difficulty level. At the same 
time. Study 2 varied the values of the guessing parameter* 

The results of Study 2 suggest that the bias problem seen in Study 1 
may be largely a result of guessing; under the no<-guesslng condition bias 
was virtually zero, except for the very highly discriminating item pools. 
This relationship was confounded with test length, however, since the 
highly discriminating item pools reached the test termination criterion in 
a very small number of items (e*g*, 5 items at a"3*00). Under the 
corrected-*guesslng condition, bias was consistently positive, and Increased 
as item discriminations Increased and mean test length became very short. 
Under the uncorrected-guesslng condition, both bias and mean absolute 
error were pronounced * 

The high correlation between test length and ability level was con- 
sistently present in Study 2 under the corrected-guesslng condition* Under 
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no-guessing and uncorrected-guessing, however, there was no such correla- 
tion because there was no variance in test length within a test* Under 
the latter conditions, test length varied only across tests — i.e», as a 
function oC item discriminating power* 

In terms oC fidelity coeCCicients, there was no appreciable difference 
between those obtained under no-guessing and under corrected- guessing, 
given tl^e common termination criterion* Under uncorrected-guessing, 
however, \there was some loss oC fidelity as test length decreased* It 
should be noted that the uncorrected-guessing condition was tantamount to 
assuming an inappropriate item response model* The result oC using the 
inappropriate model to estimate ability and to select items sequentially 
was to introduce large errors oC estimate and some loss oC fidelity* 

The observation that bias, absolute error, and fidelity seemed to be 
adversely aCCected by the short test lengths typical oC highly discrimi- 
nating item pools led to using a fixed 30-item test length in Study 3* 
The results conCirmed the hypothesis that some undesirable psychometric 
properties may accompany the use oC very highly discriminating item pools 
iC the posterior variance criterion is used to terminate testing* When 
test length remained constant, bias was virtually zero and absolute error 
diminished steadily as item discrimination increased* 

The interrelationships of test length, item discrimination, bias, and 
absolute error would be a CruitCul avenue Cor further research* l£ the 
interdependencies were understood it would be possible Cor a test user to 
control error magnitudes by appropriate choice oC test length, given knowl~ 
edge oC the parameters oC the items in the item pool* 

Study 4 investigated some oC the characteristics studied earlier but 
as a function o£ trait level* The curvilinear regression of the latent 
trait estimators on trait level illustrates the conservative nature oC Eayes 
estimators* Fairly accurate estimation is achieved in the vicinity oC the 
assumed prior mean, at the expense oC accuracy in the extremes* In a 
sense, the Eayesian procedure gives little '^credence*' to extreme trait 
values; this conservatism results "^n a consistent tendency to underestimate 
high trait level values and to overestimate low ones* With guessing present 
the overestimation problem becomes accentuated* This alone may be su££ic- 
ient to explain the positive bias seen in Studies 1 and 2: The overesti^ 
mates tend to be oC larger magnitude than the underestimates, resulting In 
an overall tendency towards overestimation* 

More signiCicant than the direction oC the conditional bias is its 
£orm* Under all three ttem pool configurations in Study 4, the bias curves 
were non-linear^ In ability testing, bias is not usually oC concern as 
long as it is constant or linear in the parameter being estimated (Lord, 
1970, p* 153), since these two cases imply a linear relationship between 
test scores and trait level parameters* llon^linear bias, on the other hand. 
Implies a non^linear relationship, which in turn adversely affects the 
utility oC the test scores* Other things being equal (e*g*, the conditional 
variances oC the test scores), iC the regression oC test scores on trait 
level is non-linear, the scores will sr^ke better discriminations at some 
trait levels than at others* 
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That this is the case with the scores resulting from Bayeiian test 
administration is evident in the information curves estimated from the data* 
Although adaptive testing has the potential to result in equi-discriminating 
ability estimates^ the Bayesian sequential adaptive test has failed to 
achieve this goal under the conditions simulated in Study 4» Under each 
item pool configuration, some region of the ability continuum had consider- 
ably higher levels of information under any configuration* Even under the 

configuration, where the best discriminating items were available in 

the lowest difficulty regions, the information curve was very low in the low 
ability region*^^ 

Lord (1970, p» 152) indicated that evaluating an adaptive test by means 
of a group statistic (such as the fidelity coefficient, r^^) presumes some 

knowledge of the group's distribution on the trait being measured, and 
ignores information relevant to the accuracy of trait estimates at any one 
level of the trait* The validity of the Bayesian sequential test trait 
estimates, as the results show, was quite high under the conditions used in 
these simulation studies* The accuracy of the estimates was also favorable 
in what corresponds to the middle ranges, of a normal distribution on 0, but 
was found to be less favorable in the extremes, especially the lower extreme* 
Similarly, the information curves of the trait estimates showed that the 
effectiveness of measurement under the Bayesian testing procedure varied 
systematically as a function of the configuration of the item parameters 
constituting the item pool, but in all three configurations measurement 
effectiveness was very low in the low ranges of the trait* 

The observed loss of accuracy and information in the extremes of the 
"typical'' range of 0 are disturbing, since a major advantage of adaptive 
testing over conventional testing is the former's supposed potential for 
superior measurement accuracy and effectiveness in those extremes* The data 
of this series of studies show that with the exception of the config- 
uration, the adaptive test scores behave much like conventional test scores, 
at least in terms of the shapes of their information curves* The utility of 
the Bayesian adaptive testing strategy may be diminished by results like 
those reported for Study 4, if they prove to be general* 

The problems of bias which is non-linear in 6, and of convex Infor- 
mation curves as observed in Study 4, have causes which ntay be amenable 
to improvement* Central to both problems is the effect of guessing, which 
generally operates to reduce measurement efficiency at all trait levels, 
and especially at low trait levels* Also at the core of the problems 
is the Bayesian procedure itself* As was pointed out earlier, the Bayesian 
trait estimates behave like regression estimates* Extreme values of 0 
are systematically regressed toward the initial prior estimate; the 
assumption of a normal prior distribution of 0 ensures this tendency* 
On the average, the more extreme Q is for any individual, the larger 
will be the regression effect* Recall that the item selection procedure 
selects an item with difficulty somewhat easier than the current 0 
estimate* But for high 0 the current estimate is almost always too low* 
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Hence the difficulty of the selected Item will almost always be too easy 
for extremely able examinees. Cumulated over 30 items, for example, 
there will be several effects of this inappropriate item selection: 

1. Mean proportion correct will tend to increase as a function 
of 6, despite the inplicit attenpt of the tailoring procedure 
to make it constant at all levels of 0; 

2. 6 will tend to be underestimated for high 0 due to the inap- 
propriate difficulty of the test items administered; 

3. Information loss will occur at high 6 due to the shallowing \ 
slope of the regression of © on 0. 

For low 0 the initial prior is an overestimate. Hence the first 
item selected will generally be too difficult, yet the examinee has a 
chance of answering it correctly by guessing. A correct answer, of course, 
will cause an increase in 0 and thus result in another inappropriate choice 
of item difficulty. Furthermore, as Samejima (1973) has shown, when 
guessing is a factor there may actually be negative information in a 
correct response to an item whose difficulty exceeds an examinee^s 
actual trait level by a fairly small increment. Thus it appears that in 
Owen^s Bayesian strategy, testees in the low extremes of 0 are rather 
consistently being administered overly difficult items with several 
systematic results: 

1* Mean proportion correct tends to decrease with 6 despite the 
tailoring process; 

2. Posterior variance reduction tends to be more rapid for individuals 
of low trait levels, due largely to their sub-optimal proportion 

of correct responses, resulting in shorter mean test length; 

3. The shorter the test length, the less opportunity the Bayesian 
estimation procedure has to converge to extreme trait level 
estimates; 

4. Non "Convergence combines with negative information in some correct 
responses to diminish severely the effectiveness of measurement in 
the low regions of the trait. 

Some of the conclusions just stated are speculative. Specifically, 
neither proportion correct as a function of 0 nor the differences (2> -0) 

were examined in this study. Both of these reflect the effectiveness of 
the tailoring process. McBride (1975), however, reported data which 
showed proportion correct to be monotonically related to 0 In another 
simulation study of Owen's Bayesian strategy. 

One goal of adaptive testing should be to achieve a constant high 
level of measurement effectiveness at all levels of 0. This objective 
is equivalent to a high, horizontal information function. The Study 4 
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results show that the Bayesian sequential testing strategy failed to 
achieve this goal despite an unrealistically favorable set of circum- 
stances: the perfect item pool> error-free item parameters* and a scoring 
model perfectly congruent with the item response model* The shortcomings 
of the Bayesian trait estimate were attributed to the regression-like 
tendency of the sequential estimates themselves > which in turn results in 
inappropriate item selection for individuals whose trait levels are 
relatively high or low* 

There are at least two methods of ameliorating this problem* both 
of which to some extent should lessen the bias of estimate at the extretxkes 
and improve the information properties of the trait estimates* The first 
txkethod involves the assumption of a rectangular rather than a normal prior 
distribution of 0* The second method would involve replacing the Bayesian 
item selection procedure with a mechanical (e*g*> non-mathematical) 
branching procedure > which would be less sensitive to large errors in the 
current trait estimate in its choice of the next item to administer* 
Needless to say> both of these alternatives involve a considerable 
departure from Owen^s elegant procedure* 

Implications * In testing persons of any given ability level, an 
ideal adaptive testing strategy would select for administration the most 
informative items available at that level* If the item pool were adequate, 
the result would be that mean proportion correct would be approximately 
constant !:cross ability levels, and the information curve of the ability 
estimates would be very high and almost flat* Such an adaptive test would 
make equally good discriminations at any level of the ability trait* It 
would also have approximately equi-^^alent utility at any level at which 
discriminations were to be made* It is apparent from the foregoing 
discussion, especially from the data of Study 4> that the properties of 
the. Bayesian sequential adaptive test fall somewhat short of this ideal* 
The research reported here has shown that the Bayesian procedure results 
in very high correlations of ability level and test scores but also results 
in ability estimates which are strongly biased in the extremes and which 
are maximally informative only in the middle region of ability* If a test 
user were concerned primarily with ordering examinees as to ability level, 
the Bayesian sequential adaptive procedure would seem quite satisfactory* 
However, the tendency of the Bayesian procedure to yield accurate measurement 
in the vicinity of the prior mean at the expense of relatively inferior 
measurement elsewhere, may mandate selecting an alternative adaptive 
strategy if the test user requires either equi-discriminating measurement 
over a wide ability range or accurate ability estimation for ability levels 
not near the mean* Sinwlation research by Vale & Weiss (1975) on 
the stradaptive ability test (Weiss, 1973) shows that adaptive testing 
strategy provides measurement with the desired characteristics* Other 
promising strategies for adaptive testing have been proposed by Lord 
(1975) and Samejima (1975)* 
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Table A-2 

Estimated Value of the Derivative Conditional Standard 

Deviation and Value of the Information Function lg(0) 

for Three Item Pool Configurations, at Each of 31 Trait Levels (9) 
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